May 15, 2015

Curiosity

Leads to search

Leads to discovery!

Leads to trapped data?

Leads to quitting?

A web of data

  • In 2008, an estimated 154 million HTML tables (out of the 14.1 billion) contain 'high quality relational data'!!!
  • Hard to quantify how much more exists outside of HTML Tables, but there is an estimate of at least 30 million lists with 'high quality relational data'.
  • A growing number of websites/companies provide programmatic access to their data/services via web APIs (that data typically comes in XML/JSON format).

Before scraping, do some googling!

  • If the resource is well-known, someone else has probably built a tool which provides easier access to the data.
  • I wrote pitchRx which downloads, parses, cleans, and transforms XML data for a specific baseball data resource. Just give it start/end dates.
  • ropensci has a ton of R packages for scientific data sources.
  • The Web Technologies and Services CRAN Task View is a concise overview of various tools for working with data that lives on the web.

A web of messy data!

  • In statistical modeling, we typically assume data is tidy.
  • That is, data appears in a tabular form where
    • 1 row == 1 observation
    • 1 column == 1 variable (observational attribute)
  • Parsing HTML/XML/JSON is easy; but navigating, cleaning, extracting relevants bits, and transforming it into a tidy form is not easy.
  • Knowing a bit about modern tools & web technologies makes it much easier.

Motivating Example

Inspecting elements

Hover to find desired elements

rvest makes scraping HTML pages super easy.

library(rvest)
# First, grab the page source
src <- html("http://en.wikipedia.org/wiki/Table_(information)")
# extract the first element with class of wikitable
node <- html_node(src, ".wikitable")
# convert that <table> element to a data frame
html_table(node)
##   First name   Last name Age
## 1     Bielat    Adamczak  24
## 2  Blaszczyk Kostrzewski  25
## 3 Olatunkboh    Chijiaku  22
## 4   Adrienne    Anthoula  22
## 5     Axelia  Athanasios  22
## 6  Jon-Kabat        Zinn  22
  • html_table() only works on a single <table> element.
  • ".wikitable" is a CSS selector which says: "grab any element with a class equal to wikitable".

rvest is designed to be pipeable.

html("http://en.wikipedia.org/wiki/Table_(information)") %>%
  html_node(".wikitable") %>% html_table()
##   First name   Last name Age
## 1     Bielat    Adamczak  24
## 2  Blaszczyk Kostrzewski  25
## 3 Olatunkboh    Chijiaku  22
## 4   Adrienne    Anthoula  22
## 5     Axelia  Athanasios  22
## 6  Jon-Kabat        Zinn  22
  • Much easier to read/understand!

Your Turn 1

Navigate this page and try the following:

Easy: Grab the table at the bottom of the page (hint: instead of grabbing a node by class with html_node(".class"), you can grab by id with html_node("#id"))

Medium: Grab the actual mean, max, and min temperature.

Hard: Grab the weather history graph and write the figure to disk (download.file() may be helpful here).

(kudos to Hadley Wickham for the example)

What about non-<table> data?

(selectorgadget + rvest) to the rescue!

Extracting links to download reports

Your Turn 2

Nativigate to Wikipedia's list of data structures use SelectorGadget + rvest to do the following:

  1. Obtain a list of Primitive types
  2. Obtain a list of the different Array types

Scraping dynamic web pages

# returns the HTML page source, which is _static_
rvest::html("http://bl.ocks.org/cpsievert/raw/2a9fb8f504cd56e9e8e3/")
## <!DOCTYPE html>
## <html><body>
##     A Simple Table made with JavaScript
##     <p></p>
##     <script>
##       function tableCreate(){
##         var body = document.body,
##           tbl  = document.createElement('table');
## 
##         for(var i = 0; i < 3; i++){
##           var tr = tbl.insertRow();
##           for(var j = 0; j < 3; j++){
##             var td = tr.insertCell();
##             td.appendChild(document.createTextNode("Cell"));
##           }
##         }
##         body.appendChild(tbl);
##       }
##       tableCreate();
##     </script>
## </body></html>
## 

Browser <-> Web Server

rdom can return the DOM as HTML:

# devtools::install_github("cpsievert/rdom")
library(rdom)
rdom("http://bl.ocks.org/cpsievert/raw/2a9fb8f504cd56e9e8e3/") %>%
  html_node("table") %>% html_table()
    X1   X2   X3
1 Cell Cell Cell
2 Cell Cell Cell
3 Cell Cell Cell

Now, without passing the entire DOM between R and phantomjs:

rdom("http://www.techstars.com/companies/stats/", "table") %>%
  html_table()

Web APIs

What about ?

  • If you're scraping a site that has a Web API, people might get angry with you…
  • A Web API is just a set of rules/standards for interacting with a Web server.
  • APIs typically require knowledge of HTTP, but a single HTTP verb (GET) is usually what you want.
  • With httr, working APIs is a breeze:
library(httr)
me <- GET("https://api.github.com/users/cpsievert")
content(me)[c("name", "company")]
## $name
## [1] "Carson"
## 
## $company
## [1] "Iowa State University"

HTTP Verbs

  • A tale of two machines: a web server and client.
  • A server-side dyanmic page requires a machine (the web server) other than the one requesting the page (the client) to process the requests.

What about non-HTML data?

Common data exchange formats

  • HTML is great for sharing content between people, but it isn't great for exchanging data between machines.
  • There are a ton of other ways to exchange data over the web, but by far the most popular ones are XML and JSON.

What is XML?

XML is a markup language that looks very similar to HTML.

<mariokart>
  <driver name="Bowser" occupation="Koopa">
    <vehicle speed="55" weight="25"> Wario Bike </vehicle>
    <vehicle speed="40" weight="67"> Piranha Prowler </vehicle>
  </driver>
  <driver name="Peach" occupation="Princess">
    <vehicle speed="54" weight="29"> Royal Racer </vehicle>
    <vehicle speed="50" weight="34"> Wild Wing </vehicle>
  </driver>
</mariokart>
  • This example shows that XML can (and is) used to store inherently tabular data (thanks Jeroen Ooms for the fun example)
  • What is are the observational units here? How many observations in total?
  • Two units and 6 total observations (4 vehicles and 2 drivers).

XML2R

XML2R is a framework to simplify acquistion of tabular/relational XML.

library(XML2R)
obs <- XML2Obs("http://bit.ly/mario-xml")
table(names(obs))
## 
##          mariokart//driver mariokart//driver//vehicle 
##                          2                          4
  • The main idea of XML2R is to coerce XML into a flat list of observations.
  • The list names track the "observational unit".
  • The list values track the "observational attributes".

obs # named list of observations
## $`mariokart//driver//vehicle`
##      speed weight XML_value     
## [1,] "55"  "25"   " Wario Bike "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "40"  "67"   " Piranha Prowler "
## 
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value      
## [1,] "54"  "29"   " Royal Racer "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value    
## [1,] "50"  "34"   " Wild Wing "
## 
## $`mariokart//driver`
##      name    occupation
## [1,] "Peach" "Princess"

collapse_obs(obs) # group into table(s) by observational name/unit
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "55"  "25"   " Wario Bike "     
## [2,] "40"  "67"   " Piranha Prowler "
## [3,] "54"  "29"   " Royal Racer "    
## [4,] "50"  "34"   " Wild Wing "
  • What information have I lost?
  • I can't map vehicles to the drivers!

obs <- add_key(obs, parent = "mariokart//driver", recycle = "name")
collapse_obs(obs)
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value           name    
## [1,] "55"  "25"   " Wario Bike "      "Bowser"
## [2,] "40"  "67"   " Piranha Prowler " "Bowser"
## [3,] "54"  "29"   " Royal Racer "     "Peach" 
## [4,] "50"  "34"   " Wild Wing "       "Peach"

Now (if I want) I can merge the tables into a single table…

tabs <- collapse_obs(obs)
merge(tabs[[1]], tabs[[2]], by = "name")
##     name occupation speed weight         XML_value
## 1 Bowser      Koopa    55     25       Wario Bike 
## 2 Bowser      Koopa    40     67  Piranha Prowler 
## 3  Peach   Princess    54     29      Royal Racer 
## 4  Peach   Princess    50     34        Wild Wing

What about JSON?

  • JSON is quickly becoming the format for data on the web.
  • JavaScript Object Notation (JSON) is comprised of two components:
    • arrays => [value1, value2]
    • objects => {"key1": value1, "key2": [value2, value3]}
  • The preferred R package for R <=> JSON conversion has long been RJSONIO
  • However, jsonlite is gaining a lot of momentum/attention.
  • In fact, shiny will soon be moving from RJSONIO to jsonlite.

Package downloads from RStudio's CRAN mirror

Back to Mariokart

[
    {
        "driver": "Bowser",
        "occupation": "Koopa",
        "vehicles": [
            {
                "model": "Wario Bike",
                "speed": 55,
                "weight": 25
            },
            {
                "model": "Piranha Prowler",
                "speed": 40,
                "weight": 67
            }
        ]
    },
    {
        "driver": "Peach",
        "occupation": "Princess",
        "vehicles": [
            {
                "model": "Royal Racer",
                "speed": 54,
                "weight": 29
            },
            {
                "model": "Wild Wing",
                "speed": 50,
                "weight": 34
            }
        ]
    }
]

library(jsonlite)
mario <- fromJSON("http://bit.ly/mario-json")
str(mario) # nested data.frames?!? 
## 'data.frame':    2 obs. of  3 variables:
##  $ driver    : chr  "Bowser" "Peach"
##  $ occupation: chr  "Koopa" "Princess"
##  $ vehicles  :List of 2
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Wario Bike" "Piranha Prowler"
##   .. ..$ speed : int  55 40
##   .. ..$ weight: int  25 67
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Royal Racer" "Wild Wing"
##   .. ..$ speed : int  54 50
##   .. ..$ weight: int  29 34

mario$driver
## [1] "Bowser" "Peach"
mario$vehicles
## [[1]]
##             model speed weight
## 1      Wario Bike    55     25
## 2 Piranha Prowler    40     67
## 
## [[2]]
##         model speed weight
## 1 Royal Racer    54     29
## 2   Wild Wing    50     34

How do we get two tables (with a common id) like the XML example?

# this mapply statement is essentially equivalent to add_key
vehicles <- mapply(function(x, y) cbind(x, driver = y), 
                   mario$vehicles, mario$driver, SIMPLIFY = FALSE)
Reduce(rbind, vehicles)
##             model speed weight driver
## 1      Wario Bike    55     25 Bowser
## 2 Piranha Prowler    40     67 Bowser
## 3     Royal Racer    54     29  Peach
## 4       Wild Wing    50     34  Peach
mario[!grepl("vehicle", names(mario))]
##   driver occupation
## 1 Bowser      Koopa
## 2  Peach   Princess

Your Turn 1 Solution

library(rvest)
src <- html("http://www.wunderground.com/history/airport/KVAY/2015/2/17/DailyHistory.html?req_city=Cherry+Hill&req_state=NJ&req_statename=New+Jersey&reqdb.zip=08002&reqdb.magic=1&reqdb.wmo=99999&MR=1")
# easy solution
tab1 <- src %>% html_node("#obsTable") %>% html_table()
# medium solution
tab2 <- src %>% html_node("#historyTable") %>% html_table() %>% .[2:4, "Actual"]
# hard solution
link <- src %>% html_node("#history-graph-image img") %>% html_attr("src")
download.file(paste0("http://www.wunderground.com", link), "fig.png")

Your Turn 2 Solution

src <- html("http://en.wikipedia.org/wiki/List_of_data_structures")
src %>% html_nodes("ul:nth-child(7) a") %>% html_text()
## [1] "Boolean"         "Character"       "Floating-point"  "real number"    
## [5] "Double"          "Integer"         "Enumerated type"
src %>% html_nodes("ul:nth-child(19) a") %>% html_text()
##  [1] "Array"                 "Bit array"            
##  [3] "Bit field"             "Bitboard"             
##  [5] "Bitmap"                "Circular buffer"      
##  [7] "Control table"         "Image"                
##  [9] "Dynamic array"         "Gap buffer"           
## [11] "Hashed array tree"     "Heightmap"            
## [13] "Lookup table"          "Matrix"               
## [15] "Parallel array"        "Sorted array"         
## [17] "Sparse array"          "Sparse matrix"        
## [19] "Iliffe vector"         "Variable-length array"